Executive summary

TODO ## Loaded packages

##  [1] "caret"       "lattice"     "crosstalk"   "corrplot"    "corrr"      
##  [6] "openxlsx"    "plotly"      "ggplot2"     "formattable" "tidyr"      
## [11] "dplyr"       "stats"       "graphics"    "grDevices"   "utils"      
## [16] "datasets"    "methods"     "base"

Data characteristics

The provided data is organized in such a way, that for each patient there are several rows. Each one of them describes a single moment of time in which a measurement of a certain group of parameters occurred. Because of this approach there are a lot of NA values in the data both rowwise and columnwise.

Rows.in.the.dataset Columns.in.the.dataset Decisive.attributes First.admission Last.discharge
6120 84 78 2020-01-10 15:52:20 2020-03-04 16:21:51
Gender Number of cases
Male 224
Female 151

Determining the correlation

To create a correlation matrix all measurements of every patient have to be aggregated into a single row. Hence an aggregation method must be chosen for columns containing more than one value. In the following block there are three different data frames created. Each of them utilizes a different aggregating method - mean, max and last. The “last” method means that only the most recent data is taken into consideration. Then all of these data frames are used to create three correlation data frames with the use of a package names corrr which allows to omit the phase of creating a correlation matrix and converting it into a data frame. In the following blocks and explanations I will refer to these three methods as “median”, “mean” and “last” correlations.

The library corrr allows to select concrete attribute that the analysis needs to “focus” on, which means that it will filter out all the correlations not connected to the selected attribute. In this study we want to determine which attributes can cause which outcome of the disease, so the focused attribute is “outcome”. The results are shown below in a form of bar plots. To maintain readability of the plots only correlations higher than 0.6 or lower than -0.6 are shown. The bars can be hovered above to show precise values of the correlations.

The correlation plots show that no matter what the aggregation method is the same group of attributes attributes is correlated to the outcome the strongest. There are some differences, but overall these are the same attributes repeated three times. Because of that the following analysis will focus mostly on neutrophils (percentage), fibrin degradation products (since D-dimer is its subtype it won’t be included), lactate dehydrogenase, high-sensitivity C-reactive protein, calcium, prothombin activity, albumin and lymphocyte percentage.

Analysis of the selected attributes

There are several interactive plots presented in this section. For visualization purposes the timestamp of each measurement was normalized - the difference between the first the actual measurement time and the first measurement that a given patient had. As a result the Normalized_time variable contains the number of hours that had passed from the first examination the patient had had. This approach allows to visualize and compare courses of a certain attribute among numerous patients on a single plot.

Neutrophils percentage

A healthy person should have about 55-70% of neutrophils in their body. This plot shows exacly, that deceased patients had very high percentage of neutrophils though the whole course of their treatment. If we look at the patients who lived we can see that their percentage of neutrophils was either in the specified, healthy range or decreased throughout the treatment.

High-sensitivity C-reactive protein

This plot show some extremely chaotic data concerning deceased patients. There is practically no trend or anything more to say about this data expect for the levels of hsCRP are quite high comparing to these of the patients who lived. If we select only the Alive patients we can see that in almost every case the hsCRP was decreasing over time. This is because hsCRP is a blood test that measures the level of inflammation in one’s body, it’s used for example for determining the chance of a heart disease or a stroke. High value returned by hsCRP means high inflammation, what makes sense concerning that people with high hsCRP infected with COVID-19 died.

Fibrin degradation products

Fibrin degradation products are components of the blood produced by clot degeneration. The value of FDP is high after any thrombotic event. The chaotic data on the plot might indicate that the patients with high FDP (which are only those who died later on) suffered from some kind of a blood dysfunction.

Lactate dehydrogenase

Lactate dehydrogenase is an enzyme that is present in almost every living cell. Its high levels (up to 4 times larger in deceased patients than in alive ones) can indicate an early stage of heart attacks and in general are a negative prognostic factor.

Calcium

Lower levels of calcium among deceased patients can indicate numerous things, however hypocalcemia can lead to several muscle-oriented problems, such as tetany or even disruption of conductivity in the cardiac tissue. The effect of low calcium levels has been researched and can be read about in this article.

Prothrombin activity

Prothrombin is a coagulation factor. This means that its role is to manage the clotting process. Low levels of prothrombin activity are related to fibrin degradation products. Low levels of prothrombin activity that occured among deceased patients can indicate problems with the clotting process.

Albumin

Albumin is a main protein that occurs in the human blood, being about 60% of all the proteins. Its main role is to maintain proper oncotic pressure, that prevents leakages of water containing electrolytes from the blood vessels into tissues. A healthy person should have albumin level ranging from 30 to 55 mg/ml of blood.

Lymphocyte percentage

Lymphocytes are, next to neutroils, one of five kinds of white blood cells. Low levels of lymphocytes can indicate autoimmune diseases, AIDS or other infectious diseases.

Classification

The dataset for the classification problem cannot contain NA variables if Random Forest is used as a training method. Because of that only several columns were chosen for the classification problem: * Lymphocyte percentage * Neutrophils percentage * High-sensitivity C-reactive protein * Lactate dehydrogenase * Albumin

These are the attributes that showed the highest correlation with the outcome, as shown in “Determining the correlation” section.

## Size of the training set:  247
## Size of the testing set:  104

Training and predicting without parameter optimaliization

  • Control parameters for the train function:
    • Method: repeatedcv (repeated cross-validation)
    • Number of folds: 2,
    • Number of complete sets of folds to compute: 5
  • The train function parameters:
    • Method: Random Forest
    • Number of trees: 10
## Random Forest 
## 
## 247 samples
##   5 predictor
##   2 classes: 'Alive', 'Dead' 
## 
## No pre-processing
## Resampling: Cross-Validated (2 fold, repeated 5 times) 
## Summary of sample sizes: 124, 123, 124, 123, 123, 124, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9684173  0.9363620
##   3     0.9651915  0.9297891
##   5     0.9538618  0.9068729
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive Dead
##      Alive    54    1
##      Dead      3   46
##                                           
##                Accuracy : 0.9615          
##                  95% CI : (0.9044, 0.9894)
##     No Information Rate : 0.5481          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9226          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##               Precision : 0.9818          
##                  Recall : 0.9474          
##                      F1 : 0.9643          
##              Prevalence : 0.5481          
##          Detection Rate : 0.5192          
##    Detection Prevalence : 0.5288          
##       Balanced Accuracy : 0.9630          
##                                           
##        'Positive' Class : Alive           
## 

Training and predicting with parameter optimalization

  • Control parameters for the train function:
    • Method: repeatedcv (repeated cross-validation)
    • Summary function: twoClassSummary
    • Number of folds: 2,
    • Number of complete sets of folds to compute: 5
  • The train function parameters:
    • Method: Random Forest
    • Metric: ROC
    • Number of trees: 30
    • Tune grid: 1:5
    • Pre-processing: center, scale
## Random Forest 
## 
## 247 samples
##   5 predictor
##   2 classes: 'Alive', 'Dead' 
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (2 fold, repeated 5 times) 
## Summary of sample sizes: 124, 123, 124, 123, 123, 124, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   1     0.9877812  0.9630597  0.9517857
##   2     0.9883376  0.9674495  0.9732143
##   3     0.9906038  0.9645083  0.9732143
##   4     0.9877185  0.9630597  0.9571429
##   5     0.9868392  0.9615672  0.9553571
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Alive Dead
##      Alive    55    1
##      Dead      2   46
##                                         
##                Accuracy : 0.9712        
##                  95% CI : (0.918, 0.994)
##     No Information Rate : 0.5481        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9419        
##                                         
##  Mcnemar's Test P-Value : 1             
##                                         
##               Precision : 0.9821        
##                  Recall : 0.9649        
##                      F1 : 0.9735        
##              Prevalence : 0.5481        
##          Detection Rate : 0.5288        
##    Detection Prevalence : 0.5385        
##       Balanced Accuracy : 0.9718        
##                                         
##        'Positive' Class : Alive         
## 

Accuracy is 1 percentage point better than before parameter tuning, Kappa value is 0,02 higher, values of the remaining measures are the same or higher than before. Because of a very high accuracy of the Random Forest method no further methods were tested.

Both high precision and recall mean that the classificator performs well, since it doesn’t return much false positives or false negatives. Not detecting ill people can be however quite problematic since it could increase the strain on the medical system even more.

Importance of the attributes of the final model

## rf variable importance
## 
##                       Overall
## Lactate_dehydrogenase  74.610
## hsCRP                  27.151
## neutrophils_percent    15.245
## lymphocyte_percent      3.140
## albumin                 2.119

The trained model shows that lactate dehydrogenase levels have the largest impact in defining whether a patient will die or not. High-sensitivity C-reactive protein is more than 2 times less important and the neutrophils percentage comes in at the third place. This outcome is confirmed by the article from which the dataset was downloaded.